Trees





Kerry Back

A decision tree

Prediction in each cell is the plurality class (for classification) or the cell mean (for regression).

Another example

Splitting criterion for classification

  • In each cell, prediction is class with most representation.
  • Each observation of other classes is an error.
  • Try to create “pure” classes.
  • Perfect purity means each cell contains only one class
    \(\Rightarrow\) no errors.

Splitting criterion for regression

  • In each cell, prediction is mean.
  • Usually try to minimize sum of squared errors.
  • Algorithm will try to find splits that separate outliers into their own cells.
  • To avoid dependence on outliers,
    • Minimize sum of absolute errors instead, or
    • Choose target variable that does not have outliers

Data

  • Monthly data in SQL database
  • 100+ predictors described in ghz-predictors.xlsx

SQL

  • select [columns or operations on columns] from [table]
  • join [another table] on [variables to match on]
  • where [select rows based on conditions]
  • order by [columns to sort on]

Connect with python

from sqlalchemy import create_engine
import pymssql
import pandas as pd

server = "mssql-82792-0.cloudclusters.net:16272"
username = "user"
password = "" # paste password between quote marks
database = "ghz"
string = "mssql+pymssql://" + username + ":" + password + "@" + server + "/" + database
conn = create_engine(string).connect()

Example: ROEQ and mom12m in 2021-12

data = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date='2021-12'
    """, 
    conn
)
data = data.dropna()

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2399 entries, 0 to 2406
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ticker  2399 non-null   object 
 1   date    2399 non-null   object 
 2   ret     2399 non-null   float64
 3   roeq    2399 non-null   float64
 4   mom12m  2399 non-null   float64
dtypes: float64(3), object(2)
memory usage: 112.5+ KB

Fit a classification tree

from sklearn.tree import DecisionTreeClassifier

data['class'] = data.ret.transform(
  lambda x: pd.qcut(x, 3, labels=(0, 1, 2))
)
X = data[["roeq", "mom12m"]]
y = data["class"]

model = DecisionTreeClassifier(
  max_depth=2, 
  random_state=0
)
model.fit(X, y)

View the classification tree

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plot_tree(model)
plt.show()

Confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X=X, y=y)
plt.show()

Predicted class probabilities

  • Three of the four leaves have a plurality of High, so all observations in those leaves get a prediction of High.
  • But the three leaves are not the same.
  • The fraction of Highs in a leaf is the probability that an observation in the leaf is High. The probabilities are
    • 53/69 = 77%
    • 315/695 = 45%
    • 409/1664 = 25%
    • 70/114 = 61%

Fit a regression tree

from sklearn.tree import DecisionTreeRegressor

X = data[["roeq", "mom12m"]]
y = data["ret"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree

plot_tree(model)
plt.show()

Which are the low ROE, high MOM stocks?

subset = data[
  (data.roeq<=-0.181) & (data.mom12m>1.672)
]
subset
ticker date ret roeq mom12m class
788 CAR 2021-12 -0.244801 -1.259494 3.927780 0
1331 VTNR 2021-12 -0.079268 -0.628843 5.520000 0
1608 MVIS 2021-12 -0.292373 -0.220156 2.294372 0

Predicting ranks

data['rnk'] = data.ret.rank(pct=True)

X = data[["roeq", "mom12m"]]
y = data["rnk"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree for ranks

plot_tree(model)
plt.show()

Predicting numerical classes

X = data[["roeq", "mom12m"]]
y = data["class"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree for classes

plot_tree(model)
plt.show()